Contract.info Content
Contract ID
Project Name
Report ID hw.animal_plant_ont
Report Date 2024-07-17

1 Background

A complete and accurate genome sequence is essential to the genomics study of new species and the investigation of complex structural genomic changes for wild relatives when comparing to published cultivar genome sequences. With the development of long-read sequencing technologies, single-molecule sequencing platforms (Pacific Biosciences) and nanopore-based platforms (Oxford Nanopore Technology) can sequence long DNA fragments and provide long reads (10-1000 kilobases) that could be used in de novo whole-genome assembly and pan-genome analysis. And thanks to the development of high-quality assembler, these long-range technologies are rapidly advancing the field of genomics research with improved reference genomes and more comprehensive variant identification, for example, all these developments have been successfully applied to decode the 32G Axolotl genome.

Since second-generation sequencing produces reads of a few hundred nucleotides at most, de novo assemblies based solely on short reads have limitations. They may misrepresent large proportions of the genome by missing important genes or collapsing repeat-rich regions. On the contrary, genome assemblies based on PacBio or Nanopore reads offer substantially higher contiguity and completeness. This is typically measured by metrics like N50 contig size, which can be over 100 times longer for long-read assemblies compared to short-read only. A high-quality assembly provides insights into the studying of a species: functional genes can be identified and gene family evolution analysis can be conducted; repeat sequence, regulatory elements and other important genomic features can be detected; it can be used as reference for mapping reads to assist in the analysis of diversity and agronomic traits; transposase-accessible chromatin, topologically associating domains, and other large chromosome features can be studied.

1.1 Nanopore Platform Advantages

  1. The long sequencing read length of third-generation sequencing can cross high-repetitive and low-complexity regions, effectively solving problems that cannot be addressed by NGS sequencing, such as high-heterozygosity genome assembly and high-repetitive sequences. This advantage allows for a more accurate and complete reconstruction of complex genome structures and variations.
  2. One key advantage of Nanopore sequencing is the ability to directly sequence DNA or RNA molecules without requiring PCR amplification. This eliminates the biases introduced by PCR and allows for a more accurate representation of the original genetic material. In addition, third-generation sequencing platforms, including the PromethION, have been shown to have no GC bias, which allows for a more accurate assessment of GC-rich or GC-poor regions in the genome. Furthermore, the ability to directly read DNA or RNA molecules also enables the detection of epigenetic modifications, such as DNA methylation, without the need for additional sample preparation steps.
  3. Compared to other sequencing technologies, Nanopore sequencing has a relatively low sequencing cost. One of the reasons for this is the simplicity of the sample preparation process. Unlike other sequencing technologies, Nanopore sequencing does not require the use of DNA polymerase or ligase enzymes, or dNTPs, which can significantly reduce the cost of sequencing. In addition, Nanopore sequencing can be performed using relatively simple equipment and does not require expensive reagents or specialized sample handling procedures, which also contributes to its lower cost. However, the exact cost of Nanopore sequencing may vary depending on the specific application and sequencing platform used.

1.2 Nanopore Sequencing Principle

Nanopore technology utilizes electrical signals to sequence single-molecule DNA/RNA by inferring the base composition through changes in current as the molecule passes through a biological nanopore. The core component of sequencing is a protein-based nanometer-scale pore, known as a “Pore”, which is inserted into a thin membrane with high electrical resistance and immersed in an ion-containing aqueous solution on both sides of the membrane. Different potentials are applied to both sides of the membrane, causing ions to move through the protein pore from one side of the membrane to the other, resulting in an electric current passing through the pore. When a single-stranded DNA/RNA molecule passes through the pore, it obstructs the flow of ions, causing different electrical signal changes as each base or DNA molecule passes through the nanopore channel due to the differing chemical properties of the bases. By detecting and comparing these signals, the corresponding base types can be obtained.

The embedded protein on the membrane, also known as a Reader protein, is the core of the entire sequencing chip and is typically a naturally occurring transmembrane protein that can be incorporated into the membrane.

During double-stranded DNA library sequencing, DNA helicase is used to help unwind the double helix structure of DNA into two single strands so that one of the single strands can pass through the protein pore. This helicase is also known as a Motor protein, which controls the movement of the single-stranded DNA through the nanopore at a speed of 450bp/s while attached to the nanopore protein.

Figure 1.1 Nanopore sequencing technology

2 Library Construction and Sequencing Process

By using the high-throughput PromethION platform, a 1D library can be constructed. The term “1D” refers to the complete separation of the forward and reverse strands in the library, which are sequenced separately during the sequencing process, similar to the libraries constructed for Illumina sequencing. The library is constructed using standard adapter ligation methods, with optional DNA fragmentation, end-repair, A-tailing, and connection of sequencing adapters, motor protein, and tether protein to prepare the DNA library.

The library construction process is shown in the Figure 2.1.

Figure 2.1 Workflow of library construction

3 Bioinformatic Analysis Workflow

Nanopore Data analysis is depicted in the diagram below:

Figure 3.1 Nanopore data analysis pipeline

4 Standard Analysis

4.1 QC Results and Performance

The original nanopore signal obtained from high throughput sequencing platforms is recorded in Fast5 format. Dorado software is used in base calling and transformed raw data to FASTQ format, which contains sequence information (reads) and corresponding sequencing quality information. The raw reads quality control is conducted by NanoPlot software to remove adapter contamination and low-quality reads. Data statistics are summarized in the table below:

Table 4.1 Nanopore data statistics

Showing 1 to 2 of 2 entries

  • Sample: Sample ID.
  • Total bases(G): Number of original sequenced bases.
  • Q>7 bases(G): Number of PASS(Q>7) bases.
  • Reads number: Sequenced read count.
  • Q>7 Reads number: Number of reads with Q>7.
  • Reads mean length: Average read length.
  • Reads length N50:Sort the obtained reads in ascending order of length, and accumulate them one by one until the length of the reads is no less than 50% of the total length.
  • Reads mean quality:Average quality value of all pass reads.

4.2 Read Statistics

Figure 4.1 Read length distribution

The horizontal axis represents reads length, and the vertical axis represents the number of reads.

The figure below depicts the relationship between read length and read or data quality.

Figure 4.2 Density map of read length and average quality.

The horizontal axis represents read mean quality, and vertical axis represents read length.

Results recorded in the following path: Result/01.LRQC

4.3 Mapping

4.3.1 Statistics of Reference Genome

Table 5.1 Statistics of reference genome

  • Reference: Reference name.
  • Seq number: Total number of the assembled sequences or scaffolds.
  • Total length: Total length of assembled genomic sequence.
  • GC content(%): GC content of the reference genome.
  • Gap rate(%): The ratio of unknown nucleotide(N) in the reference genome assembly.
  • N50 length: The length of scaffold N50, of which 50% of the sequence is higher than this value.
  • N90 length: The length of scaffold N90, of which 90% of sequence is higher than this value.
  • Total length NoN: Total length of assembled genomic sequence without N base.

4.3.2 Mapping Results

The processed clean reads are mapped to the reference genome by minimap2. The output bam file of minimap2 is then sorted and merged by Samtools. The sorted bam file is used to calculate mapping statistics, sequencing depth and coverage are summarized in the following table:

Table 5.2 Mapping rate, coverage and depth statistics

  • Sample: Sample names.
  • Clean Reads: The number of reads passed QC.
  • Mapped Reads: The number of reads mapped to the reference genome.
  • Mapping Rate: The ratio of bases mapped to the reference genome.
  • Mean Depth: Average sequencing depth.
  • 1X Coverage: 1X coverage, the ratio of base coverage length to the total length of the whole genome base.
  • 4X Coverage: 4X coverage, the ratio of base coverage length to the total length of the whole genome base.
  • 10X Coverage: 10X coverage, the ratio of base coverage length to the total length of the whole genome base.
  • 20X Coverage: 20X coverage, the ratio of base coverage length to the total length of the whole genome base.

4.3.3 Sequencing Depth & Coverage Distribution

Figure 5.1 Summary of mapping statistics of each chromosome.

The horizontal axis represents different chromosome, the mean sequence depth of each chromosome is indicated by the height of each chromosome bar (left vertical axis), while the fraction of covered base on each chromosome is indicated by scatter plot (right vertical axis).

Results are stored in the following path: Result/02.Mapping

4.4 Structural variations (SVs) Calling Result

Structural variations (SVs) are genomic variation with mutaions of relatively larger size (>50 bp), including deletions, duplications, insertions, inversions and translocations. SV can be the source of the individual difference and the disease susceptibility among different species.Detection of large genomic variation (SV) has proven challenging using short-read methods. Long-read approaches, such as sequencing with PacBio/Nanopore platform, can produce continuous reads spanning the large events and have shown promises to dramatically expand the ability to call structural variation.Novogene employs Sniffles for SV detection based on PacBio HiFi/Nanopore data, Sniffles is a structural variation detection tool which maintains great balance between accuracy and resolution, and therefore provide faithful SV detection result.

4.4.1 SV Statistics

Third-generation sequencing can provide long-read sequencing with high accuracy, sequencing read can span to highly repetitive and complex region. The sensitivity and accuracy of third-generation SV calling is much higher than next-generation sequencing.

Table 6.1 Summary of SV statistics

Showing 1 to 2 of 2 entries

  • Sample: Sample names.
  • Upstream: Variant overlaps 1-kb region upstream of transcription start site.
  • UTR5: Variant overlaps 5' end of untranslated region.
  • UTR3: Variant overlaps 3' end of untranslated region.
  • UTR5/UTR3: Variant overlaps both 5' and 3' end of untranslated region
  • Exonic: Variant overlaps a coding region.
  • Unknowns: Variant with unknown function.
  • ncRNA:Variant overlaps with non-coding RNA
  • Intronic: Variant overlaps intronic region.
  • Splicing: Variant overlaps intronic region and at most 5 bp away from the boundary of exon and intron.
  • Downstream: Variant overlaps 1-kb region down stream of transcription end site.
  • Upstream/Downstream: Variant overlaps one gene's upstream region and another gene's downstream region at the same time.
  • Intergenic: Variant overlaps intergenic region.
  • Others: Other types of SV.
  • INS: Insertion.
  • DEL: Deletion.
  • INV: Inversion.
  • BND: Breakpoint.
  • DUP: Duplication.
  • Total: The total number of SVs.

4.4.2 Summary of SV Statistics

Figure 6.1 SV length distribution

The vertical axis represents the length of detected SVs, and the horizontal axis represents the ratio of SVs with certain length.

Figure 6.2 SV type overview.

The horizontal axis represents sample name, and the vertical axis represents the ratio of different types of SV.

Figure 6.3 Summary of SV locations.

This figure depicts the proportion of SV located regions.

4.4.3 SV Annotation

Table 6.2 SV annotation.

Showing 1 to 4 of 4 entries

The table above is only a preview of the full results, which could be found in directory: Result/03.Variants/SV

Please click here for annotation notes:

1.Chr: chromosome number.

2.Start: variant start site.

3.End: variant end site.

4.Ref: sequence of reference genome, the N in reference genome is ignored.

5.Alt: altered sequence, the BND, DEL INV and other types are 0.

6.GeneName: list of gene names involved in altered region.

7.Func: annotation of variant overlaped region(exonic, splicing, UTR5, UTR3, intronic, ncRNA_exonic, ncRNA_intronic, ncRNA_UTR3, ncRNA_UTR5, ncRNA _splicing, upstream, downstream, intergenic). Note :1) exonic include coding, UTR3 anUTR5; 2) once the variant located in a region with multiple functions, the order of annotation lised based on the significance of the function: Exonic = splicing > ncRNA> > UTR5/UTR3 > intron > upstream/downstream > intergenic. UTR5,UTR3 is for variant overlaps a region which is UTR5 of gene and UTR3 of a second gene at the same time. "upstream, downstream" indicates variant overlaps a region which is the upstream region of a gene and the downstream region of a second gene at the same time,

8.Gene: The transcript name(s). If a variant has 'intergenic' in 'Func' field, this field will give two neighboring transcripts. If a variant hits multiple transcripts with different functional categories, only transcript names in accordance with the value of 'Func' field will be output. For example, rs333970 hits the exonic, splicing, intronic, exonic of the four transcripts of gene CSF1, the 'Func' value will be 'exonic; splicing' and the 'Gene' value will be 'NM_000757, NM_172210, NM_172212' (NM_172211 will be ignored).

9.GeneDetail: description of variant impact on the transcript. Note: once the variant overlaps intergenic region, the value of "dist" represents the distance between variant and nearby gene.

10.ExonicFunc: functional effect of SNV and InDel (SNV include synonymous_SNV, missense_SNV, stopgain, stopgloss and unknown; InDel include frameshift insertion, frameshift deletion, stopgain, stoploss, nonframeshift insertion, nonframeshift deletion and unknown).

11.AAChange: amino acid change caused by variant.

12.Otherinfo1: Genotype, Homozygous: 0/0 corresponds to 0, and 1/1 corresponds to 1; Hybrid: 0/1 corresponds to 0.5.

13.Otherinfo2: QUAL in VCF files, is the Phred-scaled probability that the site has no variant and is computed as: Phred = -10 * log (1-p), p is the probability that variant exists; The higher the value, the more likely it is to be variant.

14.Otherinfo3: '.' .

15.Otherinfo4: SV name detected by caller.

16.Otherinfo5: if BND ,sequence of reference genome.

17.Otherinfo6: pos of reference genome.

18.Otherinfo7: SV description, INFO in the VCF file.

19.Otherinfo8: FORMAT in the VCF file: deifferent parameters are separated by ":". GT:Genotype; 0: no alteration detected; 1,2,3 indicate detetected allele is different from reference allele. Homozygous:0/0,1/1;Heterozygous:0/1. GQ: Conditional genotype quality. DR: High-quality reference reads. DV: High-quality variant reads.

20.Otherinfo9: The FORMAT value corresponding to Otherinfo8.

5 Advanced Analysis

5.1 Results of CNV Calling

Copy number variation (CNV) refers to a circumstance in which the number of copies of a specific segment of DNA varies among different individuals’ genomes. The individual variants may be short or include thousands of bases. These structural differences may have come about through duplications, deletions or other changes and can affect long stretches of DNA. Such regions may or may not contain a gene(s). Novogene uses CNVkit to call genome-wide CNVs. CNVkit implements a pipeline for CNV detection that takes advantage of both on– and off-target sequencing reads and applies a series of corrections to improve the accuracy in copy number calling.

Table 7.1 Statistics of CNV calling result

  • Sample: Sample name.
  • DEL: Number of deletion.
  • DUP: Number of duplication.
  • Total_CNV_nums: Total CNV count.
  • Total_CNV_len: Total CNV length.
  • mean_CNV_len: Average CNV length.

Results recorded in the following path: Result/03.Variants/CNV

5.2 Visualization of Variants by Circos Plot

The plot below is generated by Circos, we can visualize the variants (SV) distribution across the whole genome.

Figure 8.1 Mutation map of the whole genome map.

The outermost circle is the position coordinates of the genome sequence, from outside to inside, respectively, the distribution density display of structural variation(SV) type, in order: SV insertion(INS), SV deletion(DEL), SV inversion(INV), SV repetition(DUP), SV translocation(BND).

Results restored in the following path: Result/03.Variants/Circos

6 References

[1] Deamer, David, Mark Akeson, and Daniel Branton. 2016. “Three Decades of Nanopore Sequencing.” Nature Biotechnology 34 (5): 518–24.

[2] Talevich, Eric, Shain, Hunter, A., & Botton. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. PLoS Computational Biology. 2016, 12. (CNVkit)

[3] Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics, (2021). 37:4572-4574. (minimap2)

[4] Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data[J]. Nucleic acids research, 2010, 38(16): e164-e164. (ANNOVAR)

[5] Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods. 2018 Jun;15(6):461-468. (Sniffles)